{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 什么是爬虫\n",
    "\n",
    "爬虫：一段自动抓取互联网信息的程序，从互联网上抓取对于我们有价值的信息。\n",
    "\n",
    "简单理解网络爬虫就是自动抓取网页信息的代码，可以简单理解成代替繁琐的复制粘贴操作的手段。\n",
    "\n",
    "爬虫的对象必须是你已经看到的网页\n",
    "\n",
    "\n",
    "> 通过编程向网络服务器请求数据（HTML表单），然后解析HTML，提取出自己想要的数据\n",
    "\n",
    "\n",
    "![](\\assets\\spider_01.jpg)\n",
    "\n",
    "归纳为四大步：\n",
    "\n",
    "- 根据url获取HTML数据\n",
    "- 解析HTML，获取目标信息\n",
    "- 存储数据\n",
    "- 重复第一步\n",
    "\n",
    "这会涉及到数据库、网络服务器、HTTP协议、HTML、数据科学、网络安全、图像处理等非常多的内容。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "看一个例子，通过BeautifulSoup来实现的爬虫，抓取标题完整代码如下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vision Transformer这两年\n",
      "​Meta发布 “科研者的福音”，上线仅三天被骂到撤退\n",
      "这一秒，困扰了程序员 50 年\n",
      "安卓开发者的跨平台Flutter or Compose ？\n",
      "嗨Jina，帮我画一幅高山流水图\n",
      "Spring Boot 3.0 正式发布\n",
      "Spring Framework 6发布\n",
      "2022年五个微SaaS创富方向\n",
      "微信新增图片、视频拖动一键发送功能\n",
      "历史上的今天：中国顶级域名CN被注册\n"
     ]
    }
   ],
   "source": [
    "import requests # 导入网页请求库\n",
    "from bs4 import BeautifulSoup # 导入网页解析库\n",
    "\n",
    "# 传入URL\n",
    "r = requests.get('https://www.csdn.net/')\n",
    "\n",
    "# 解析URL\n",
    "soup = BeautifulSoup(r.text, 'html.parser')\n",
    "content_list = soup.find_all('div', attrs = {'class': 'headswiper-item'})\n",
    "\n",
    "for content in content_list:\n",
    "    print(content.a.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 概念介绍\n",
    "\n",
    "知道了爬虫是用来干什么的之后，我们来介绍一些最常见到的概念\n",
    "\n",
    "1.URL\n",
    "\n",
    "URL中文称为统一资源定位符，其实可以理解成网页的链接，比如上面的https://www.csdn.net/ 就是一个URL\n",
    "\n",
    "我们通常所说的传入URL指的就是把网页的链接传进去。上面代码中\n",
    "\n",
    "```python\n",
    "r = requests.get('https://www.csdn.net/')\n",
    "```\n",
    "\n",
    "就是在将URL传入请求函数。\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2.网页请求\n",
    "\n",
    "说到网页请求，就有必要讲一下我们平常浏览网页时，信息交互的模式大概是什么样的。我们平常用浏览器浏览网页的时候，鼠标点了一个链接，比如你现在点击这里，其实浏览器帮你向这个网页发送了请求(request)，维护网页的服务器（可以理解为CSDN公司里的一台电脑，在维护这CSDN上的各个网页）收到了这个请求，判定这个请求是有效的，于是返回了一些响应信息(response)到浏览器，浏览器将这些信息进行渲染（可以理解成 处理成人能看懂的样子），就是你看到的网页的样子了。发送请求与接收请求的过程就和 发微信和收到回复的过程类似。\n",
    "\n",
    "而现在我们要用代码来模拟鼠标点击的过程。上面的requests.get就是让代码帮你向这个网页发送了这个请求，如果请求被判定为有效，网页的服务器也会把信息传送给你，传送回来的这些信息就被赋值到变量r之中。所以这个变量r里就包含有我们想要的信息了，也包括那些我们想要提取的标题。\n",
    "\n",
    "我们可以print(r.text)看一下里面有什么东西"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(r.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "源代码和r.text其实是一模一样的东西。r.text其实就是一个字符串，字符串中有我们刚刚抓取到的所有标题，我们只要通过字符串匹配方法（比如正则表达式）将他们提取出来就可以了。\n",
    "\n",
    "简而言之，我们就是需要r.text字符串中提取信息即可。其实爬虫就是这么简单。\n",
    "\n",
    "但是解析是怎么回事呢，为什么刚刚不直接用正则而要用bs4呢？因为方便，但是正则也是完全可以的，只是相对麻烦一些、需要写更多的代码而已。\n",
    "\n",
    "什么是正则 [正则表达式-简单爬虫的实例](https://blog.csdn.net/qq_42370313/article/details/101283345)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.网页解析\n",
    "\n",
    "网页解析其实就从网页服务器返回给我们的信息中提取我们想要数据的过程。其实使用正则表达式提取我们要的标题的过程也可以称为网页解析。\n",
    "\n",
    "因为当前绝大多数网页源代码都是用HTML语言写的，而HTML语言时非常有规律性的，比如我们要的所有文章标题都具有相同结构，也就是说它周围的字符串都是非常类似的，这样我们才能批量获取。所以就有大佬专门封装了如何从HTML代码中提取特定文本的库，也就是我们平时说的网页解析库，如bs4 lxml pyquery等，其实把他们当成处理字符串的就可以了。\n",
    "\n",
    "为了更清楚地了解如何对网页进行解析，我们需要先粗略掌握HTML代码的结构。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 什么是HTML\n",
    "\n",
    "\n",
    "HTML 是整个网页的结构，相当于整个网站的框架。带“＜”、“＞”符号的都是属于 HTML 的标签，并且标签都是成对出现的。\n",
    "\n",
    "常见的标签如下：\n",
    "\n",
    "```python\n",
    "<html>..</html> 表示标记中间的元素是网页\n",
    "<body>..</body> 表示用户可见的内容\n",
    "<div>..</div> 表示框架\n",
    "<p>..</p> 表示段落\n",
    "<li>..</li>表示列表\n",
    "<img>..</img>表示图片\n",
    "<h1>..</h1>表示标题\n",
    "<a href=\"\">..</a>表示超链接\n",
    "```\n",
    "**CSS**  表示样式，在 CSS 中定义了外观。\n",
    "**JScript** 表示功能。交互的内容和各种特效都在 JScript 中，JScript 描述了网站中的各种功能。\n",
    "\n",
    "如果用人体来比喻，HTML 是人的骨架，并且定义了人的嘴巴、眼睛、耳朵等要长在哪里。CSS 是人的外观细节，如嘴巴长什么样子，眼睛是双眼皮还是单眼皮，是大眼睛还是小眼睛，皮肤是黑色的还是白色的等。JScript 表示人的技能，例如跳舞、唱歌或者演奏乐器等。\n",
    "\n",
    "\n",
    ">HTML (HyperText Markup Language) is the most basic building block of the Web. It defines the meaning and structure of web content. Other technologies besides HTML are generally used to describe a web page's appearance/presentation (CSS) or functionality/behavior (JavaScript).\n",
    "\n",
    "一个HTML的例子"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 关于爬虫的合法性\n",
    "\n",
    "几乎每一个网站都有一个名为 robots.txt 的文档，当然也有部分网站没有设定 robots.txt。对于没有设定 robots.txt 的网站可以通过网络爬虫获取没有口令加密的数据，也就是该网站所有页面数据都可以爬取。如果网站有 robots.txt 文档，就要判断是否有禁止访客获取的数据。\n",
    "\n",
    "https://xueqiu.com/robots.txt\n",
    "\n",
    "\n",
    "User-agent: * 代表的所有的搜索引擎种类，\n",
    "\n",
    "Disallow: /admin/ 这里定义是禁止爬寻admin目录下面的目录\n",
    "\n",
    "Disallow: /require/ 这里定义是禁止爬寻require目录下面的目录\n",
    "\n",
    "Disallow: /ABC/ 这里定义是禁止爬寻ABC目录下面的目录\n",
    "\n",
    "Disallow: /cgi-bin/*.htm 禁止访问/cgi-bin/目录下的所有以”.htm”为后缀的URL(包含子目录）。\n",
    "\n",
    "Disallow: /*?* 禁止访问网站中所有包含问号 (?) 的网址\n",
    "\n",
    "Disallow: /.jpg$ 禁止抓取网页所有的.jpg格式的图片\n",
    "\n",
    "Disallow:/ab/adc.html 禁止爬取ab文件夹下面的adc.html文件。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 使用 requests 库请求网站\n",
    "安装 requests 库\n",
    "\n",
    "pip install requests\n",
    "\n",
    "\n",
    "#### 爬虫的基本原理\n",
    "\n",
    "网页请求的过程分为两个环节：\n",
    "- Request （请求）：每一个展示在用户面前的网页都必须经过这一步，也就是向服务器发送访问请求。\n",
    "- Response（响应）：服务器在接收到用户的请求后，会验证请求的有效性，然后向用户（客户端）发送响应的内容，客户端接收服务器响应的内容，将内容展示出来，就是我们所熟悉的网页请求.\n",
    "\n",
    "\n",
    "网页请求的方式也分为两种：\n",
    "- GET：最常见的方式，一般用于获取或者查询资源信息，也是大多数网站使用的方式，响应速度快。\n",
    "- POST：相比 GET 方式，多了以表单形式上传参数的功能，因此除查询信息外，还可以修改信息。\n",
    "\n",
    "所以，在写爬虫前要先确定向谁发送请求，用什么方式发送。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### 使用 GET 方式抓取数据\n",
    "import requests        #导入requests包\n",
    "url = 'https://www.sina.com.cn/'\n",
    "res = requests.get(url)        #Get方式获取网页数据\n",
    "res.encoding = 'utf-8'\n",
    "print(res.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 为什么要设置headers?\n",
    "\n",
    "在请求网页爬取的时候，输出的text信息中会出现抱歉，无法访问等字眼，这就是禁止爬取，需要通过反爬机制去解决这个问题。\n",
    "\n",
    "headers是解决requests请求反爬的方法之一，相当于我们进去这个网页的服务器本身，假装自己本身在爬取数据。\n",
    "\n",
    "对反爬虫网页，可以设置一些headers信息，模拟成浏览器取访问网站 。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res.request.headers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### headers 哪里找\n",
    "\n",
    "谷歌或者火狐浏览器，在网页面上点击：右键-检查-选择network，刷新网页，找到第一个文件，查看requests headers\n",
    "\n",
    "[Headers之User-Agent设置](https://zhuanlan.zhihu.com/p/35625779)\n",
    "\n",
    "headers中有很多内容，主要常用的就是user-agent 和 host，他们是以键对的形式展现出来，如果user-agent 以字典键对形式作为headers的内容，就可以反爬成功，就不需要其他键对；否则，需要加入headers下的更多键对形式。\n",
    "\n",
    "![avatar](assets/header.png)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "### 使用 GET 方式抓取数据\n",
    "\n",
    "import requests        #导入requests包\n",
    "url = 'https://xueqiu.com/'\n",
    "headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}\n",
    "res = requests.get(url, headers = headers)        #Get方式获取网页数据\n",
    "res.encoding = 'utf-8'  # 保证中文的显示\n",
    "#print(res.text)\n",
    "res.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'errno': 998, 'errmsg': '未知错误', 'query': '中国银行', 'from': 'zh', 'to': 'en', 'error': 998}\n"
     ]
    }
   ],
   "source": [
    "#### 使用POST 方式\n",
    "\n",
    "import requests        #导入requests包\n",
    "import json\n",
    "\n",
    "def get_translate(word):\n",
    "    # General Request URL\n",
    "    url = 'https://fanyi.baidu.com/v2transapi?from=zh&to=en'\n",
    "    form_data = {'from':'zh', 'to':'en', 'query':word, 'transtype':'translang', 'simple_means_flag':'3', 'sign':'777849.998728', 'token':'8fdc86c1912abf9a6792ab0df40760c5', 'domain':'common'}\n",
    "    headers = {\n",
    "        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',\n",
    "        'Cookie':'BIDUPSID=BD1F6031E6E1F58EBA5780A6882450DB; PSTM=1511924987; BDUSS=2F1RlhGZEZGZzJFOE1RZExnazN2bWZTTWc4aWswVWRnOTFKUWh3UjA1MVJZblpiQUFBQUFBJCQAAAAAAAAAAAEAAACIDXkhbmloZTc4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFHVTltR1U5bZ; H_WISE_SIDS=139560_141910_100805_142081_142208_142066_135847_141001_138596_140853_141916_142002_137758_138878_137985_141200_140173_131246_137746_138165_107319_138883_140260_141838_140632_139043_140202_140592_136861_138585_141651_140988_141900_140113_140324_140579_133847_131423_140367_140965_136537_141102_110085_141941_127969_140593_131953_139887_140995_138425_138943_141190_141924; BDUSS_BFESS=2F1RlhGZEZGZzJFOE1RZExnazN2bWZTTWc4aWswVWRnOTFKUWh3UjA1MVJZblpiQUFBQUFBJCQAAAAAAAAAAAEAAACIDXkhbmloZTc4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFHVTltR1U5bZ; delPer=0; PSINO=3; ZD_ENTRY=bing; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; session_name=cn.bing.com; MCITY=-%3A; BAIDUID=32A052A10603308B16C560D6C9D060BE:FG=1; BAIDUID_BFESS=F170A5773764B336959E01B66D608AFB:FG=1; session_id=1604470913738; H_PS_PSSID=1424_33043_32947_33059_31253_32971_32706_32961_32846; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1605020596; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1605020596; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __yjsv5_shitong=1.0_7_50d6a0d85be4941d2e489f60c1c0ce1e8ce8_300_1605020594794_124.160.64.90_b1c9306c; yjs_js_security_passport=64e43975da70565c859cacfcc9d315009e61e875_1605020596_js'\n",
    "        }\n",
    "    #请求表单数据\n",
    "    response = requests.post(url,data=form_data, headers=headers)\n",
    "    #将Json格式字符串转字典\n",
    "    content = json.loads(response.text)\n",
    "    print(content)\n",
    "\n",
    "\n",
    "get_translate('中国银行')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### BeautifulSoup\n",
    "\n",
    "就是一个第三方的库，使用之前需要安装\n",
    "\n",
    "pip install bs4\n",
    "\n",
    "pip install lxml\n",
    "\n",
    "- bs4是什麽？\n",
    "\n",
    "它的作用是能够快速方便简单的提取网页中指定的内容，给我一个网页字符串，然后使用它的接口将网页字符串生成一个对象，然后通过这个对象的方法来提取数据\n",
    "\n",
    "- lxml是什麽？\n",
    "\n",
    "lxml是一个解析器，也是下面的xpath要用到的库，bs4将网页字符串生成对象的时候需要用到解析器，就用lxml，或者使用官方自带的解析器 html.parser\n",
    "\n",
    "**一般步骤：**\n",
    "1. 通过requests库爬取html页面的内容\n",
    "2. 使用BeautifulSoup库对爬取到的html页面进行解析\n",
    "3. 使用BeautifulSoup以及正则表达式来进一步提取我们想要的关键信息\n",
    "4. 将信息格式化并输出\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SINA 爬虫实例\n",
    "\n",
    "# 你可能需要安装 \n",
    "# pip install pymysql\n",
    "# pip install mysqlclient\n",
    "\n",
    "from urllib.parse import urlencode\n",
    "from urllib.request import urlopen,Request\n",
    "from urllib.error import URLError,HTTPError\n",
    "import json\n",
    "import time\n",
    "\n",
    "import pandas as pd  \n",
    "from sqlalchemy import create_engine \n",
    "\n",
    "conn_sql = 'mysql+mysqldb://root:local123@127.0.0.1:3306/newsnews?charset=utf8'\n",
    "# conn = create_engine(conn_sql)\n",
    "\n",
    "def html_download(url):\n",
    "     headers = {\n",
    "            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/53'\n",
    "            }\n",
    "     request = Request(url,headers = headers)\n",
    "     try:\n",
    "         html = urlopen(request).read().decode()\n",
    "     except HTTPError as e:\n",
    "         html = None\n",
    "         print('请求服务器出错：%s'%e.reason)\n",
    "         return None\n",
    "     except URLError as e:\n",
    "         html = None\n",
    "         print('请求网页出错：%s'%e.reason)\n",
    "         return None\n",
    "     return html\n",
    " \n",
    "def json2df(json_results):\n",
    "    res = pd.DataFrame.from_records(json_results)\n",
    "    tags = []\n",
    "    for r in res.iterrows():\n",
    "        try:\n",
    "            tags.append(r[-1]['tag'][0]['name'])\n",
    "        except:\n",
    "            tags.append('其他')\n",
    "    x = res.loc[:,['id','commentid','creator','rich_text','update_time','zhibo_id']]\n",
    "    x['tag'] = tags\n",
    "    return x\n",
    " \n",
    "def api_info_manager(page, zhibo_id = 152):\n",
    "    #http://zhibo.sina.com.cn/api/zhibo/feed?&page=1&page_size=100&zhibo_id=152\n",
    "    data = {\n",
    "            'page':page,\n",
    "            'page_size':100,\n",
    "            'zhibo_id':zhibo_id\n",
    "            }\n",
    "    dataformat = 'http://zhibo.sina.com.cn/api/zhibo/feed?' + urlencode(data)\n",
    "    response = html_download(dataformat)\n",
    "    return json.loads(response)['result']['data']['feed']['list']\n",
    "    #json_results = json.dumps(json_results,ensure_ascii = False)\n",
    "    #print(json_results)\n",
    "\n",
    "        \n",
    "def save_to_sql(res):\n",
    "    try:\n",
    "        r = res.sort_values(by='id', ascending = True)\n",
    "        # You need a database named news_online \n",
    "        pd.io.sql.to_sql(r,'sina_fin_news2', con=conn, schema='newsnews', if_exists = 'append')\n",
    "        print('Successful!')\n",
    "    except Exception:\n",
    "        print('Fail')\n",
    "\n",
    "def update_sql(res):\n",
    "    try:\n",
    "        last_id = int(pd.read_sql_query('select id from sina_fin_news ORDER BY id desc LIMIT 1', conn).id)\n",
    "        in_list = res[res['id']>last_id]\n",
    "        new_l = in_list.sort_values(by='id', ascending = True)\n",
    "        pd.io.sql.to_sql(new_l,'sina_fin_news', con=conn, schema='news', if_exists = 'append')\n",
    "       \n",
    "        print('{} items has been update Successed on {}'.format(len(new_l), time.strftime('%Y-%m-%d %H:%M:%S')))\n",
    "    except Exception:\n",
    "        \n",
    "        print('Fail to update on {}'.format(time.strftime('%Y-%m-%d %H:%M:%S')))\n",
    "    \n",
    "def main(page):\n",
    "    json_res = api_info_manager(page)\n",
    "    res = json2df(json_res)\n",
    "    save_to_sql(res)\n",
    "\n",
    "def updating():\n",
    "    while True:\n",
    "        json_res = api_info_manager(1)\n",
    "        res = json2df(json_res)\n",
    "        update_sql(res)\n",
    "        time.sleep(3600)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data = {\n",
    "        'page':1,\n",
    "        'page_size':100,\n",
    "        'zhibo_id':152\n",
    "        }\n",
    "dataformat = 'http://zhibo.sina.com.cn/api/zhibo/feed?' + urlencode(data)\n",
    "response = html_download(dataformat)\n",
    "doc = json.loads(response)['result']['data']['feed']['list']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http://zhibo.sina.com.cn/api/zhibo/feed?page=1&page_size=100&zhibo_id=152\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['欧洲央行行长拉加德：2022年第三季度，欧元区GDP增速放缓至0.2%。',\n",
       " '【中国恒大：出售深圳超级总部地块将录得亏损约1.63亿元】中国恒大在港交所公告，2022年11月26日，公司附属公司恒大集团有限公司（转让方）与深圳市安和一号房地产开发有限公司（受让方）签订协议，据此转让方将其持有的一块位于深圳市的商业地转让予受让方，总代价为人民币75.43亿元。公告称，预期集团就出售事项将会录得亏损约人民币1.63亿元。该土地将按现状转让。该土地位于深圳市深湾三路与白石四道交汇处东南角，土地面积10,376.82平方米，用于商业服务发展，可建总面积不超过289,200平方米。据此前报道，该宗地位于深圳湾超级总部基地，恒大在2017年12月拿下这幅块地，当初是以底价55.52亿元成交，原本计划建成恒大超级总部并于2024年竣工。',\n",
       " '【会稽山：控股股东拟变更为中建信浙江公司】会稽山公告，根据重整计划，中建信将支付18.73亿元取得精功集团持有的公司1.49亿股股份（占公司总股本的29.99%），该等股份将全部转入中建信之全资子公司中建信浙江公司；精功集团持有公司的剩余1484.18万股股份（占公司总股本的2.98%）将置入服务信托1号。本次权益变动前，公司控股股东为精功集团，公司实际控制人为金良顺；本次权益变动后，中建信浙江公司拟将成为公司的控股股东，方朝阳作为中建信浙江公司的实际控制人，也拟将成为公司的实际控制人。',\n",
       " '  【上海东方明珠电视塔明日起暂时关闭 恢复开放时间另行通知】上海东方明珠广播电视塔将于2022年11月29日（周二）起暂时关闭，即刻生效。恢复开放时间另行通知。',\n",
       " '【ST爱迪尔：北京达天成科技有限公司参与公司重整】ST爱迪尔公告，公司与重整意向投资方签署了《预重整投资（意向）协议之补充协议》，原意向产业投资人之一深圳市赢盛数字科技有限公司退出意向投资人联合体，北京达天成科技有限公司作为新的意向产业投资人，加入意向投资人联合体，共同参与公司重整。',\n",
       " '阿联酋阿布扎比国家石油公司ADNOC：将建立低碳解决方案以及垂直于新能源、天然气、液化天然气和化学品的国际增长领域。',\n",
       " '市场消息：中国恒大将以75亿元人民币出售一块位于深圳的商业地，录得约1.63亿元人民币的亏损。受让方深圳市安和一号房地产开发有限公司是在中国成立的有限责任公司，主要从事房地产建设业务。',\n",
       " '欧洲央行行长拉加德：利率是且并将继续是对抗通胀的主要工具。',\n",
       " '欧洲央行行长拉加德：利率需要进一步提高多少，以及以多快的速度提高，将取决于我们的最新展望、冲击的持续性、工资和通胀预期的反应，以及我们对传导的评估。',\n",
       " '【精功科技：控股股东重整计划获得法院裁定批准 控股股东、实际控制人拟发生变更】精功科技公告，控股股东精功集团有限公司管理人收到绍兴市柯桥区人民法院送达的《民事裁定书》，裁定批准《精功集团等九公司重整计划》，并终止精功集团等九公司重整程序。同日公告，根据《重整计划》，中建信控股集团有限公司将支付投资对价为11.84亿元取得精功集团持有公司股份中的1.37亿股股份（占公司总股本的29.99%），该等股份将全部转入中建信之全资子公司中建信（浙江）创业投资有限公司，精功集团持有精功科技的剩余530.74万股股份将置入浙金·精功集团有限公司等九公司破产重整服务信托1号。本次权益变动后，中建信浙江公司拟将成为公司的控股股东，方朝阳拟将成为公司的实际控制人。']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(dataformat)\n",
    "[d['rich_text'] for d in doc][:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    【导弹爆炸事件后，德国称拟向波兰提供“爱国者”防空系统】据路透社20日报道，德国国防部长兰布...\n",
       "1                              香山公园：双清别墅22日起暂停开放。（新京报）\n",
       "2      【教育部：及时发现纠正高校防疫不当做法 及时协调解决师生合理诉求】教育部党组书记、部长怀...\n",
       "3      捷克央行副行长马雷克莫拉周一表示，他赞成将利率作为对抗高通胀的主要货币政策工具，而不是支...\n",
       "4        夜盘开盘，沥青主力合约跌超2%，沪锡主力合约跌近2%；涨幅方面，纯碱主力合约涨0.65%。\n",
       "5    【英飞特：车载充电机产品已在多家主机厂进行测试、导入】英飞特在披露的投资者关系活动记录表中表...\n",
       "6      【日媒：内阁大臣接连辞职可能促使日本首相改组内阁】据日本共同社11月21日报道，近期日本...\n",
       "7      【世界杯期间伊朗每天向卡塔尔出口250吨食品和农产品】当地时间20日，伊朗贸易促进组织（...\n",
       "8      Imago BioSciences美股盘前持续走高，现涨超104%，消息称默沙东将收购该公司。\n",
       "9    【360数科孖展暂录5447万港元 国际及本地均获足额认购】营信贷科技平台的中概股360数科...\n",
       "Name: rich_text, dtype: object"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_res = api_info_manager(1)\n",
    "res = pd.DataFrame.from_records(json_res)\n",
    "res.iloc[:10,3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>commentid</th>\n",
       "      <th>creator</th>\n",
       "      <th>rich_text</th>\n",
       "      <th>update_time</th>\n",
       "      <th>zhibo_id</th>\n",
       "      <th>tag</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2874039</td>\n",
       "      <td>live:finance-152-2874039:0</td>\n",
       "      <td>yongsheng6@staff.sina.com.cn</td>\n",
       "      <td>【导弹爆炸事件后，德国称拟向波兰提供“爱国者”防空系统】据路透社20日报道，德国国防部长兰布...</td>\n",
       "      <td>2022-11-21 21:04:20</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2874038</td>\n",
       "      <td>live:finance-152-2874038:0</td>\n",
       "      <td>yongsheng6@staff.sina.com.cn</td>\n",
       "      <td>香山公园：双清别墅22日起暂停开放。（新京报）</td>\n",
       "      <td>2022-11-21 21:03:27</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2874037</td>\n",
       "      <td>live:finance-152-2874037:0</td>\n",
       "      <td>yongsheng6@staff.sina.com.cn</td>\n",
       "      <td>【教育部：及时发现纠正高校防疫不当做法 及时协调解决师生合理诉求】教育部党组书记、部长怀...</td>\n",
       "      <td>2022-11-21 21:03:03</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2874036</td>\n",
       "      <td>live:finance-152-2874036:0</td>\n",
       "      <td>yongsheng6@staff.sina.com.cn</td>\n",
       "      <td>捷克央行副行长马雷克莫拉周一表示，他赞成将利率作为对抗高通胀的主要货币政策工具，而不是支...</td>\n",
       "      <td>2022-11-21 21:02:27</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2874035</td>\n",
       "      <td></td>\n",
       "      <td>yongsheng6@staff.sina.com.cn</td>\n",
       "      <td>夜盘开盘，沥青主力合约跌超2%，沪锡主力合约跌近2%；涨幅方面，纯碱主力合约涨0.65%。</td>\n",
       "      <td>2022-11-21 21:00:23</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>2873944</td>\n",
       "      <td>live:finance-152-2873944:0</td>\n",
       "      <td>yongsheng6@staff.sina.com.cn</td>\n",
       "      <td>【保险公司开展个人养老金业务规定正式下发】从业内获悉，银保监会下发《关于保险公司开展个人养老...</td>\n",
       "      <td>2022-11-21 18:46:47</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>2873943</td>\n",
       "      <td>live:finance-152-2873943:0</td>\n",
       "      <td>yongsheng6@staff.sina.com.cn</td>\n",
       "      <td>【万科A：拟新增不超500亿元发行直接债务融资工具授权】万科A公告，董事会同意向股东大会申请...</td>\n",
       "      <td>2022-11-21 18:38:57</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>2873942</td>\n",
       "      <td></td>\n",
       "      <td>mingyu@staff.sina.com.cn</td>\n",
       "      <td>【斯里兰卡10月全国消费者价格通胀放缓至70.6%】斯里兰卡国际统计部门周一表示，斯里兰卡国...</td>\n",
       "      <td>2022-11-21 18:30:05</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>2873941</td>\n",
       "      <td>live:finance-152-2873941:0</td>\n",
       "      <td>mingyu@staff.sina.com.cn</td>\n",
       "      <td>惠誉：预计2023年更多小型新兴和前沿市场将失去市场准入，面临紧急融资挑战，可能出现更多违约。</td>\n",
       "      <td>2022-11-21 18:29:40</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>2873940</td>\n",
       "      <td>live:finance-152-2873940:0</td>\n",
       "      <td>mingyu@staff.sina.com.cn</td>\n",
       "      <td>惠誉：预计2023年主权债务成本的突然飙升可能会更加频繁。</td>\n",
       "      <td>2022-11-21 18:29:35</td>\n",
       "      <td>152</td>\n",
       "      <td>国际</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         id                   commentid                       creator  \\\n",
       "0   2874039  live:finance-152-2874039:0  yongsheng6@staff.sina.com.cn   \n",
       "1   2874038  live:finance-152-2874038:0  yongsheng6@staff.sina.com.cn   \n",
       "2   2874037  live:finance-152-2874037:0  yongsheng6@staff.sina.com.cn   \n",
       "3   2874036  live:finance-152-2874036:0  yongsheng6@staff.sina.com.cn   \n",
       "4   2874035                              yongsheng6@staff.sina.com.cn   \n",
       "..      ...                         ...                           ...   \n",
       "95  2873944  live:finance-152-2873944:0  yongsheng6@staff.sina.com.cn   \n",
       "96  2873943  live:finance-152-2873943:0  yongsheng6@staff.sina.com.cn   \n",
       "97  2873942                                  mingyu@staff.sina.com.cn   \n",
       "98  2873941  live:finance-152-2873941:0      mingyu@staff.sina.com.cn   \n",
       "99  2873940  live:finance-152-2873940:0      mingyu@staff.sina.com.cn   \n",
       "\n",
       "                                            rich_text          update_time  \\\n",
       "0   【导弹爆炸事件后，德国称拟向波兰提供“爱国者”防空系统】据路透社20日报道，德国国防部长兰布...  2022-11-21 21:04:20   \n",
       "1                             香山公园：双清别墅22日起暂停开放。（新京报）  2022-11-21 21:03:27   \n",
       "2     【教育部：及时发现纠正高校防疫不当做法 及时协调解决师生合理诉求】教育部党组书记、部长怀...  2022-11-21 21:03:03   \n",
       "3     捷克央行副行长马雷克莫拉周一表示，他赞成将利率作为对抗高通胀的主要货币政策工具，而不是支...  2022-11-21 21:02:27   \n",
       "4       夜盘开盘，沥青主力合约跌超2%，沪锡主力合约跌近2%；涨幅方面，纯碱主力合约涨0.65%。  2022-11-21 21:00:23   \n",
       "..                                                ...                  ...   \n",
       "95  【保险公司开展个人养老金业务规定正式下发】从业内获悉，银保监会下发《关于保险公司开展个人养老...  2022-11-21 18:46:47   \n",
       "96  【万科A：拟新增不超500亿元发行直接债务融资工具授权】万科A公告，董事会同意向股东大会申请...  2022-11-21 18:38:57   \n",
       "97  【斯里兰卡10月全国消费者价格通胀放缓至70.6%】斯里兰卡国际统计部门周一表示，斯里兰卡国...  2022-11-21 18:30:05   \n",
       "98    惠誉：预计2023年更多小型新兴和前沿市场将失去市场准入，面临紧急融资挑战，可能出现更多违约。  2022-11-21 18:29:40   \n",
       "99                      惠誉：预计2023年主权债务成本的突然飙升可能会更加频繁。  2022-11-21 18:29:35   \n",
       "\n",
       "    zhibo_id tag  \n",
       "0        152  国际  \n",
       "1        152  国际  \n",
       "2        152  国际  \n",
       "3        152  国际  \n",
       "4        152  国际  \n",
       "..       ...  ..  \n",
       "95       152  国际  \n",
       "96       152  国际  \n",
       "97       152  国际  \n",
       "98       152  国际  \n",
       "99       152  国际  \n",
       "\n",
       "[100 rows x 7 columns]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = res.loc[:,['id','commentid','creator','rich_text','update_time','zhibo_id']]\n",
    "tags = res.loc[0,:]['tag'][0]['name']\n",
    "x['tag'] = tags\n",
    "x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "### 雪球 \n",
    "\n",
    "from urllib.parse import urlencode\n",
    "import requests\n",
    "from requests.exceptions import RequestException\n",
    "import json\n",
    "import time\n",
    "import datetime\n",
    "\n",
    "\n",
    "import pandas as pd  \n",
    "from sqlalchemy import create_engine \n",
    "\n",
    "\n",
    "conn_sql = 'mysql+mysqldb://root:local123@127.0.0.1:3306/news?charset=utf8'\n",
    "#conn = create_engine(conn_sql)  \n",
    "\n",
    "\n",
    "\n",
    "def getcookies():#获得雪球网的cookie\n",
    "    headers3 = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0',\n",
    "           'Referer': 'https://xueqiu.com/today',\n",
    "           'Host': 'xueqiu.com',\n",
    "           }\n",
    "    r = requests.get(url = 'https://xueqiu.com/', headers=headers3)\n",
    "    if r.status_code == 200:\n",
    "        cookie = r.cookies.get_dict()\n",
    "        return cookie\n",
    "    return None\n",
    "\n",
    "def html_download(url, cookie):\n",
    "    headers3 = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0',\n",
    "               'referer': 'https://xueqiu.com/',\n",
    "               'Host': 'xueqiu.com',\n",
    "               }\n",
    "    try:\n",
    "        request = requests.get(url, headers=headers3,cookies=cookie)\n",
    "        if request.status_code == 200:\n",
    "            return request.text\n",
    "        return None\n",
    "    except RequestException:\n",
    "        return None\n",
    "\n",
    "\n",
    "def api_info_manager(cookie):\n",
    "    data = {\n",
    "            'since_id': -1,\n",
    "            'max_id': -1,\n",
    "            'count': 20,\n",
    "            'category': 6\n",
    "            }\n",
    "    dataformat = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?' + urlencode(data)\n",
    "    response = html_download(dataformat,cookie)\n",
    "    if not response: \n",
    "        # if not works, do it again\n",
    "        cookie = getcookies()\n",
    "        response = html_download(dataformat,cookie)\n",
    "    \n",
    "    info = json.loads(response)['list']\n",
    "    x = pd.DataFrame(columns =['id','category','text','target','view_count','created_at'])\n",
    "    for i,info in enumerate(json.loads(response)['list']):\n",
    "        x.loc[i,'id'] = info['id']\n",
    "        x.loc[i,'category'] = info['category']\n",
    "        x.loc[i,'text'] = json.loads(info['data'])['text']\n",
    "        x.loc[i,'target'] = json.loads(info['data'])['target']\n",
    "        x.loc[i,'view_count'] = json.loads(info['data'])['view_count']\n",
    "        x.loc[i,'created_at'] = datetime.datetime.fromtimestamp(int(json.loads(info['data'])['created_at']/1000)).strftime('%Y-%m-%d %H:%M:%S')\n",
    "    return x        \n",
    "     \n",
    "\n",
    "def update_sql(res):\n",
    "    try:\n",
    "        last_id = int(pd.read_sql_query('select id from xueqiu_fin_news ORDER BY id desc LIMIT 1', conn).id)\n",
    "        in_list = res[res['id']>last_id]\n",
    "        new_l = in_list.sort_values(by='id', ascending = True)\n",
    "        pd.io.sql.to_sql(new_l,'xueqiu_fin_news', con=conn, schema='news_online', if_exists = 'append')\n",
    "        print('{} items has been update Successed on {}'.format(len(new_l), time.strftime('%Y-%m-%d %H:%M:%S')))\n",
    "    except Exception:\n",
    "        print('Fail to update on {}'.format(time.strftime('%Y-%m-%d %H:%M:%S')))\n",
    "    \n",
    "def main(page):\n",
    "    cookie = getcookies()\n",
    "    res = api_info_manager(cookie)\n",
    "    update_sql(res)\n",
    "\n",
    "def updating(cookie):  \n",
    "    while True:\n",
    "        res = api_info_manager(cookie)\n",
    "        update_sql(res)\n",
    "        time.sleep(3600)\n",
    "\n",
    "#if __name__ == '__main__':\n",
    "    #cookie = getcookies()\n",
    "    #updating(cookie)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>category</th>\n",
       "      <th>text</th>\n",
       "      <th>target</th>\n",
       "      <th>view_count</th>\n",
       "      <th>created_at</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2677752</td>\n",
       "      <td>6</td>\n",
       "      <td>【会稽山：控股股东拟变更为中建信浙江公司】会稽山公告，根据重整计划，中建信将支付18.73亿...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236511965</td>\n",
       "      <td>5905</td>\n",
       "      <td>2022-11-28 22:16:18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2677750</td>\n",
       "      <td>6</td>\n",
       "      <td>【上海东方明珠电视塔明日起暂时关闭 恢复开放时间另行通知】上海东方明珠广播电视塔将于2022...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236511806</td>\n",
       "      <td>8719</td>\n",
       "      <td>2022-11-28 22:13:13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2677749</td>\n",
       "      <td>6</td>\n",
       "      <td>【ST爱迪尔：北京达天成科技有限公司参与公司重整】ST爱迪尔公告，公司与重整意向投资方签署了...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236511643</td>\n",
       "      <td>9131</td>\n",
       "      <td>2022-11-28 22:10:54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2677736</td>\n",
       "      <td>6</td>\n",
       "      <td>【中石油在建最大规模石脑油加氢装置试投产成功】 28日14时30分，位于广东省揭阳市的广东石...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236510799</td>\n",
       "      <td>17404</td>\n",
       "      <td>2022-11-28 21:57:18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2677732</td>\n",
       "      <td>6</td>\n",
       "      <td>美元兑俄罗斯卢布USD/RUB扩大涨幅，自11月22日以来首次触及61。</td>\n",
       "      <td>http://xueqiu.com/5124430882/236510357</td>\n",
       "      <td>20945</td>\n",
       "      <td>2022-11-28 21:50:25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2677731</td>\n",
       "      <td>6</td>\n",
       "      <td>【世界卫生组织发表声明，建议重新命名猴痘】当地时间11月28日，世卫组织发表声明，建议在英语...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236510334</td>\n",
       "      <td>20980</td>\n",
       "      <td>2022-11-28 21:50:10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2677722</td>\n",
       "      <td>6</td>\n",
       "      <td>【中联重科：G系列国四新品挖掘机发布会11月30日举行】据中联重科消息，11月30日晚，G系...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236509253</td>\n",
       "      <td>25809</td>\n",
       "      <td>2022-11-28 21:33:57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2677720</td>\n",
       "      <td>6</td>\n",
       "      <td>加拿大第三季度经常帐逆差111亿加元，预期逆差40亿加元，前值顺差26.9亿加元。</td>\n",
       "      <td>http://xueqiu.com/5124430882/236509126</td>\n",
       "      <td>25721</td>\n",
       "      <td>2022-11-28 21:32:07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2677717</td>\n",
       "      <td>6</td>\n",
       "      <td>【拼多多业绩会：账面利润临时增加不可持续】拼多多发布2022年第三季度财报，三季度收入为35...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236509065</td>\n",
       "      <td>30296</td>\n",
       "      <td>2022-11-28 21:30:54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2677713</td>\n",
       "      <td>6</td>\n",
       "      <td>【宁夏银川：开展购房促销活动，首套房最低首付比例降至20%】即日起银川市开展为期半个月的购房...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236508553</td>\n",
       "      <td>28483</td>\n",
       "      <td>2022-11-28 21:23:41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2677712</td>\n",
       "      <td>6</td>\n",
       "      <td>特斯拉将于2023年第三季度开始研发改进型Model 3，代号为“HIGHLAND”的新Mo...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236507951</td>\n",
       "      <td>31486</td>\n",
       "      <td>2022-11-28 21:15:10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2677707</td>\n",
       "      <td>6</td>\n",
       "      <td>【恒丰银行与多家房地产企业达成签约意向】从恒丰银行获悉，近日，恒丰银行与山东银丰集团、青岛青...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236507484</td>\n",
       "      <td>31096</td>\n",
       "      <td>2022-11-28 21:08:18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2677704</td>\n",
       "      <td>6</td>\n",
       "      <td>【蓝光发展：控股股东所持1.7亿股公司股票将被公开拍卖】蓝光发展11月28日晚公告称，公司接...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236507356</td>\n",
       "      <td>30665</td>\n",
       "      <td>2022-11-28 21:06:29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2677700</td>\n",
       "      <td>6</td>\n",
       "      <td>标普全球市场情报：瑞士信贷5年期信贷违约掉期（CDS）升至398个基点，为历史新高。</td>\n",
       "      <td>http://xueqiu.com/5124430882/236507218</td>\n",
       "      <td>29226</td>\n",
       "      <td>2022-11-28 21:04:35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2677699</td>\n",
       "      <td>6</td>\n",
       "      <td>【罗尔斯·罗伊斯与易捷航空成功测试氢燃料飞机发动机，开创全球航空业先河】罗尔斯·罗伊斯11月...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236507207</td>\n",
       "      <td>28078</td>\n",
       "      <td>2022-11-28 21:04:18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>2677696</td>\n",
       "      <td>6</td>\n",
       "      <td>【隆基增资通威永祥二期20万吨硅料项目】通威公告宣布，通威旗下全资子公司四川永祥与隆基绿能拟...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236507173</td>\n",
       "      <td>32379</td>\n",
       "      <td>2022-11-28 21:03:54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2677694</td>\n",
       "      <td>6</td>\n",
       "      <td>【10月底证券期货经营机构私募资管业务规模突破16万亿元】11月28日，中国证券投资基金业协...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236507098</td>\n",
       "      <td>25356</td>\n",
       "      <td>2022-11-28 21:02:51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>2677693</td>\n",
       "      <td>6</td>\n",
       "      <td>国联股份：前期有媒体发布《国联股份的惊天谎言？客商复杂交织背后“隐现”融资性贸易网》等报道，...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236507075</td>\n",
       "      <td>33931</td>\n",
       "      <td>2022-11-28 21:02:24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>2677689</td>\n",
       "      <td>6</td>\n",
       "      <td>【美国夏威夷莫纳罗亚火山开始喷发 警戒级别升级】美国地质调查局火山活动部门表示，当地时间11...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236506966</td>\n",
       "      <td>24127</td>\n",
       "      <td>2022-11-28 21:00:36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>2677688</td>\n",
       "      <td>6</td>\n",
       "      <td>夜盘期货开盘，原油跌近3%，燃油、低硫燃油、PVC、沥青、甲醇跌逾1%；沪锡、铁矿石、焦炭、...</td>\n",
       "      <td>http://xueqiu.com/5124430882/236506961</td>\n",
       "      <td>22207</td>\n",
       "      <td>2022-11-28 21:00:29</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         id category                                               text  \\\n",
       "0   2677752        6  【会稽山：控股股东拟变更为中建信浙江公司】会稽山公告，根据重整计划，中建信将支付18.73亿...   \n",
       "1   2677750        6  【上海东方明珠电视塔明日起暂时关闭 恢复开放时间另行通知】上海东方明珠广播电视塔将于2022...   \n",
       "2   2677749        6  【ST爱迪尔：北京达天成科技有限公司参与公司重整】ST爱迪尔公告，公司与重整意向投资方签署了...   \n",
       "3   2677736        6  【中石油在建最大规模石脑油加氢装置试投产成功】 28日14时30分，位于广东省揭阳市的广东石...   \n",
       "4   2677732        6               美元兑俄罗斯卢布USD/RUB扩大涨幅，自11月22日以来首次触及61。   \n",
       "5   2677731        6  【世界卫生组织发表声明，建议重新命名猴痘】当地时间11月28日，世卫组织发表声明，建议在英语...   \n",
       "6   2677722        6  【中联重科：G系列国四新品挖掘机发布会11月30日举行】据中联重科消息，11月30日晚，G系...   \n",
       "7   2677720        6          加拿大第三季度经常帐逆差111亿加元，预期逆差40亿加元，前值顺差26.9亿加元。   \n",
       "8   2677717        6  【拼多多业绩会：账面利润临时增加不可持续】拼多多发布2022年第三季度财报，三季度收入为35...   \n",
       "9   2677713        6  【宁夏银川：开展购房促销活动，首套房最低首付比例降至20%】即日起银川市开展为期半个月的购房...   \n",
       "10  2677712        6  特斯拉将于2023年第三季度开始研发改进型Model 3，代号为“HIGHLAND”的新Mo...   \n",
       "11  2677707        6  【恒丰银行与多家房地产企业达成签约意向】从恒丰银行获悉，近日，恒丰银行与山东银丰集团、青岛青...   \n",
       "12  2677704        6  【蓝光发展：控股股东所持1.7亿股公司股票将被公开拍卖】蓝光发展11月28日晚公告称，公司接...   \n",
       "13  2677700        6         标普全球市场情报：瑞士信贷5年期信贷违约掉期（CDS）升至398个基点，为历史新高。   \n",
       "14  2677699        6  【罗尔斯·罗伊斯与易捷航空成功测试氢燃料飞机发动机，开创全球航空业先河】罗尔斯·罗伊斯11月...   \n",
       "15  2677696        6  【隆基增资通威永祥二期20万吨硅料项目】通威公告宣布，通威旗下全资子公司四川永祥与隆基绿能拟...   \n",
       "16  2677694        6  【10月底证券期货经营机构私募资管业务规模突破16万亿元】11月28日，中国证券投资基金业协...   \n",
       "17  2677693        6  国联股份：前期有媒体发布《国联股份的惊天谎言？客商复杂交织背后“隐现”融资性贸易网》等报道，...   \n",
       "18  2677689        6  【美国夏威夷莫纳罗亚火山开始喷发 警戒级别升级】美国地质调查局火山活动部门表示，当地时间11...   \n",
       "19  2677688        6  夜盘期货开盘，原油跌近3%，燃油、低硫燃油、PVC、沥青、甲醇跌逾1%；沪锡、铁矿石、焦炭、...   \n",
       "\n",
       "                                    target view_count           created_at  \n",
       "0   http://xueqiu.com/5124430882/236511965       5905  2022-11-28 22:16:18  \n",
       "1   http://xueqiu.com/5124430882/236511806       8719  2022-11-28 22:13:13  \n",
       "2   http://xueqiu.com/5124430882/236511643       9131  2022-11-28 22:10:54  \n",
       "3   http://xueqiu.com/5124430882/236510799      17404  2022-11-28 21:57:18  \n",
       "4   http://xueqiu.com/5124430882/236510357      20945  2022-11-28 21:50:25  \n",
       "5   http://xueqiu.com/5124430882/236510334      20980  2022-11-28 21:50:10  \n",
       "6   http://xueqiu.com/5124430882/236509253      25809  2022-11-28 21:33:57  \n",
       "7   http://xueqiu.com/5124430882/236509126      25721  2022-11-28 21:32:07  \n",
       "8   http://xueqiu.com/5124430882/236509065      30296  2022-11-28 21:30:54  \n",
       "9   http://xueqiu.com/5124430882/236508553      28483  2022-11-28 21:23:41  \n",
       "10  http://xueqiu.com/5124430882/236507951      31486  2022-11-28 21:15:10  \n",
       "11  http://xueqiu.com/5124430882/236507484      31096  2022-11-28 21:08:18  \n",
       "12  http://xueqiu.com/5124430882/236507356      30665  2022-11-28 21:06:29  \n",
       "13  http://xueqiu.com/5124430882/236507218      29226  2022-11-28 21:04:35  \n",
       "14  http://xueqiu.com/5124430882/236507207      28078  2022-11-28 21:04:18  \n",
       "15  http://xueqiu.com/5124430882/236507173      32379  2022-11-28 21:03:54  \n",
       "16  http://xueqiu.com/5124430882/236507098      25356  2022-11-28 21:02:51  \n",
       "17  http://xueqiu.com/5124430882/236507075      33931  2022-11-28 21:02:24  \n",
       "18  http://xueqiu.com/5124430882/236506966      24127  2022-11-28 21:00:36  \n",
       "19  http://xueqiu.com/5124430882/236506961      22207  2022-11-28 21:00:29  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cookie = getcookies()\n",
    "res = api_info_manager(cookie)\n",
    "res"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'u': '711669645519758',\n",
       " 'xq_a_token': 'df4b782b118f7f9cabab6989b39a24cb04685f95',\n",
       " 'xq_id_token': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTY3MjE4Njc1MSwiY3RtIjoxNjY5NjQ1NDgwODUxLCJjaWQiOiJkOWQwbjRBWnVwIn0.nV9kA5Y2lQcrsv0iSbTx6zMsFLJonBgNbv0uZ49ngAECXtBmIUKx6GYbZy3yDwy0_vnCEp7MZZj_cf49PjXDVg84nGbEbpx1KGbxDoFcPBCJYvvk3fp0dTrDF8aGeZBtM0W1h2c8ixNUDO07IzgMB2C3ImzunM4arr_IeV42ESHWDrBes3d1UAp4Di2icKCrGTYvsZ0AtuEUOTaJKj8B0kUSTMaGP24ZcQ-u2CySmeHXXhj31p61a74xTHWBZKnmZRuSgZIVRpG7OF7KMxRtxUdX0BYZ4KiEq3PX1fvbrBuR0oIKw535MHppUyghaxv7xW2Rx6CmJRTLOdEATxot6w',\n",
       " 'xq_r_token': '3ae1ada2a33de0f698daa53fb4e1b61edf335952',\n",
       " 'xqat': 'df4b782b118f7f9cabab6989b39a24cb04685f95',\n",
       " 'acw_tc': '2760826116696455197513727e8bef32455229c5a43b30dc28511ce00e7597'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cookie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://finance.sina.com.cn/roll/index.d.html?cid=56588&page=1'\n",
    "headers = {\n",
    "            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/53'\n",
    "            }\n",
    "res = requests.get(url, headers=headers)\n",
    "res.encoding = 'utf-8'\n",
    "res.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 东方财富研报\n",
    "\n",
    "首先使用chrome打开东方财富网个股研报网站：http://data.eastmoney.com/report/stock.jshtml\n",
    "\n",
    "可以看到，网址 http://data.eastmoney.com/report/stock.jshtml 中并没有参数指明当前请求第一页的数据，我们单击下一页，上方网址不变，因此可以确定，东财采用异步加载的方式加载数据。也就是说上方的网址并不是请求研报数据的最终网址，我们需要抓包来找到访问数据的url以加载研报数据。\n",
    "\n",
    "![](assets/dongfang_js.png)\n",
    "\n",
    "注意：选择 查看 -- Network  --  JS -- list？\n",
    "\n",
    "而且我们可以看出，这个结构类似于一个字典，那么通过JSON的形式我们应该可以取得数据。\n",
    "\n",
    "步骤：\n",
    " - 构造 URL\n",
    " - 读取JSON数据\n",
    " - 存入数据库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 构造URL\n",
    "\n",
    "由于数据在不同页面，我们考察 URL的特点：\n",
    "\n",
    "https://reportapi.eastmoney.com/report/list?cb=datatable1084556&industryCode=*&pageSize=50&industry=*&rating=&ratingChange=&beginTime=2019-12-04&endTime=2021-12-04&pageNo=2&fields=&qType=0&orgCode=&code=*&rcode=&p=2&pageNum=2&pageNumber=2&_=1638549221083\n",
    "\n",
    "\n",
    "尝试一下不同页面，发现有区别的部分\n",
    "\n",
    "![](assets/url.png)\n",
    "\n",
    "红色框中的内容明显是有意义的，蓝色框中，可以发觉第一个数字关系不大，而第二个数字是时间戳（单调变大）\n",
    "\n",
    "因此我们构造URL如下 (爬取前10页)\n",
    "\n",
    ">urls = ('http://reportapi.eastmoney.com/report/list?cb=datatable2678479&industryCode=*&pageSize=50&industry=*&\\\n",
    "    rating=&ratingChange=&beginTime=2019-11-12&endTime=2021-11-12&pageNo=%d&fields=&qType=0&orgCode=&code=*&rcode=&p=%d&\\\n",
    "        pageNum=%d&pageNumber=%d&_=%d'%(i,i,i,i,int(round(time.time() * 1000))) for i in range(1,10) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'title': '北交所新股申购报告：特瑞斯：天然气行业持续景气，集输设备“小巨人”稳步增长',\n",
       " 'stockName': '特瑞斯',\n",
       " 'stockCode': '834014',\n",
       " 'orgCode': '80000162',\n",
       " 'orgName': '开源证券股份有限公司',\n",
       " 'orgSName': '开源证券',\n",
       " 'publishDate': '2022-11-28 00:00:00.000',\n",
       " 'infoCode': 'AP202211281580639306',\n",
       " 'column': '002004001002',\n",
       " 'predictNextTwoYearEps': '',\n",
       " 'predictNextTwoYearPe': '',\n",
       " 'predictNextYearEps': '',\n",
       " 'predictNextYearPe': '',\n",
       " 'predictThisYearEps': '',\n",
       " 'predictThisYearPe': '',\n",
       " 'predictLastYearEps': '',\n",
       " 'predictLastYearPe': '',\n",
       " 'actualLastTwoYearEps': '',\n",
       " 'actualLastYearEps': '',\n",
       " 'industryCode': '',\n",
       " 'industryName': '',\n",
       " 'emIndustryCode': '',\n",
       " 'indvInduCode': '',\n",
       " 'indvInduName': '',\n",
       " 'emRatingCode': '',\n",
       " 'emRatingValue': '',\n",
       " 'emRatingName': '',\n",
       " 'lastEmRatingCode': '',\n",
       " 'lastEmRatingValue': '',\n",
       " 'lastEmRatingName': '',\n",
       " 'ratingChange': '',\n",
       " 'reportType': 2,\n",
       " 'author': ['11000170965.诸海滨'],\n",
       " 'indvIsNew': '001',\n",
       " 'researcher': '诸海滨',\n",
       " 'newListingDate': '',\n",
       " 'newPurchaseDate': '',\n",
       " 'newIssuePrice': '',\n",
       " 'newPeIssueA': '',\n",
       " 'indvAimPriceT': '',\n",
       " 'indvAimPriceL': '',\n",
       " 'attachType': '0',\n",
       " 'attachSize': 2192,\n",
       " 'attachPages': 25,\n",
       " 'encodeUrl': 'QQdoqp8zl0/fdMk6u57PXTkSOAXkVQIKnnd+QmUtVGk=',\n",
       " 'sRatingName': '',\n",
       " 'sRatingCode': '',\n",
       " 'market': 'BEIJING',\n",
       " 'authorID': ['11000170965'],\n",
       " 'count': 1,\n",
       " 'orgType': 'white'}"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取数据\n",
    "\n",
    "urls = ( 'http://reportapi.eastmoney.com/report/list?cb=datatable2678479&industryCode=*&pageSize=50&industry=*&\\\n",
    "    rating=&ratingChange=&beginTime=2020-11-12&endTime=2022-11-28&pageNo=%d&fields=&qType=0&orgCode=&code=*&rcode=&p=%d&\\\n",
    "        pageNum=%d&pageNumber=%d&_=%d'%(i,i,i,i,int(round(time.time() * 1000))) for i in range(1,10) )\n",
    "\n",
    "headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}\n",
    "\n",
    "\n",
    "html = requests.get(url=next(urls),headers=headers)\n",
    "\n",
    "data_list = json.loads(html.text[17:-1])['data']\n",
    "\n",
    "data_list[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>stockName</th>\n",
       "      <th>stockCode</th>\n",
       "      <th>orgSName</th>\n",
       "      <th>emRatingName</th>\n",
       "      <th>publishDate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>公司深度研究：美好生活系列报告之三——激光显示渔鱼双授，B+C端比翼齐飞</td>\n",
       "      <td>光峰科技</td>\n",
       "      <td>688007</td>\n",
       "      <td>国海证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2022-11-11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                  title stockName stockCode orgSName  \\\n",
       "0  公司深度研究：美好生活系列报告之三——激光显示渔鱼双授，B+C端比翼齐飞      光峰科技    688007     国海证券   \n",
       "\n",
       "  emRatingName publishDate  \n",
       "0           买入  2022-11-11  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(columns=['title','stockName','stockCode','orgSName','emRatingName','publishDate'])\n",
    "i = 0\n",
    "title,stockName,stockCode,orgSName,emRatingName,publishDate = \\\n",
    "data_list[i]['title'],data_list[i]['stockName'],data_list[i]['stockCode'],data_list[i]['orgSName'],data_list[i]['emRatingName'],data_list[i]['publishDate'][:10]\n",
    "d = {'title':[title],'stockName':[stockName],'stockCode':[stockCode],'orgSName':[orgSName],'emRatingName':[emRatingName],'publishDate':[publishDate]}\n",
    "df = pd.concat([df, pd.DataFrame(d)])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "## 完整代码 \n",
    "\n",
    "import pandas as pd\n",
    "import time\n",
    "import requests\n",
    "import json\n",
    "from sqlalchemy import create_engine \n",
    "\n",
    "\n",
    "#conn_sql = 'mysql+mysqldb://root:local123@127.0.0.1:3306/news?charset=utf8'\n",
    "#conn = create_engine(conn_sql)  \n",
    "\n",
    "urls = ( 'http://reportapi.eastmoney.com/report/list?cb=datatable2678479&industryCode=*&pageSize=50&industry=*&\\\n",
    "    rating=&ratingChange=&beginTime=2020-11-12&endTime=2022-11-12&pageNo=%d&fields=&qType=0&orgCode=&code=*&rcode=&p=%d&\\\n",
    "        pageNum=%d&pageNumber=%d&_=%d'%(i,i,i,i,int(round(time.time() * 1000))) for i in range(1,10) )\n",
    "\n",
    "headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}\n",
    "\n",
    "df = pd.DataFrame(columns=['title','stockName','stockCode','orgSName','emRatingName','publishDate']) #保存最终内容的df要在循环外创建\n",
    "\n",
    "for url in urls:\n",
    "    html = requests.get(url=url,headers=headers)\n",
    "    if html.status_code==200: #状态码200表示访问成功\n",
    "        data_list = json.loads(html.text[17:-1])['data']\n",
    "        #data_list是个列表，列表里每一个元素都是一个大字典。\n",
    "\n",
    "        \n",
    "        for i in range(len(data_list)):\n",
    "            #提取标题，股票名，股票代码，证券公司，买卖建议，时间 这6项信息\n",
    "            title,stockName,stockCode,orgSName,emRatingName,publishDate = \\\n",
    "                data_list[i]['title'],data_list[i]['stockName'],data_list[i]['stockCode'],data_list[i]['orgSName'],data_list[i]['emRatingName'],data_list[i]['publishDate'][:10]\n",
    "            d = {'title':[title],'stockName':[stockName],'stockCode':[stockCode],'orgSName':[orgSName],'emRatingName':[emRatingName],'publishDate':[publishDate]}\n",
    "            df = pd.concat([df, pd.DataFrame(d)])\n",
    "\n",
    "\n",
    "\n",
    "#pd.io.sql.to_sql(df,'dongfang_research_news', con=conn, schema='news', if_exists = 'append')\n",
    "#print('{} items has been saved Successed on {}'.format(len(df), time.strftime('%Y-%m-%d %H:%M:%S')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>stockName</th>\n",
       "      <th>stockCode</th>\n",
       "      <th>orgSName</th>\n",
       "      <th>emRatingName</th>\n",
       "      <th>publishDate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>公司深度研究：美好生活系列报告之三——激光显示渔鱼双授，B+C端比翼齐飞</td>\n",
       "      <td>光峰科技</td>\n",
       "      <td>688007</td>\n",
       "      <td>国海证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2022-11-11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>传统油气能源设备商收购洪田科技，布局电解铜箔设备迎新成长曲线</td>\n",
       "      <td>道森股份</td>\n",
       "      <td>603800</td>\n",
       "      <td>东吴证券</td>\n",
       "      <td>增持</td>\n",
       "      <td>2022-11-11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>首次覆盖报告：东方迎风起，线缆如潮至</td>\n",
       "      <td>东方电缆</td>\n",
       "      <td>603606</td>\n",
       "      <td>东亚前海证券</td>\n",
       "      <td>增持</td>\n",
       "      <td>2022-11-11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>成立三孚北理，研发驱动创新</td>\n",
       "      <td>三孚新科</td>\n",
       "      <td>688359</td>\n",
       "      <td>东亚前海证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2022-11-11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>商业板块运营稳定，融资支持受益度居前</td>\n",
       "      <td>新城控股</td>\n",
       "      <td>601155</td>\n",
       "      <td>中邮证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2022-11-11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>盈利能力持续提升，限电与疫情影响短期收入</td>\n",
       "      <td>川仪股份</td>\n",
       "      <td>603100</td>\n",
       "      <td>财通证券</td>\n",
       "      <td>增持</td>\n",
       "      <td>2022-11-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q3扣非高增157%再超预期，接单及盈利延续上行趋势</td>\n",
       "      <td>盛泰集团</td>\n",
       "      <td>605138</td>\n",
       "      <td>天风证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2022-11-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>综合性半导体存储巨头，两大品牌四大产线奠定国际领先地位</td>\n",
       "      <td>江波龙</td>\n",
       "      <td>301308</td>\n",
       "      <td>天风证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2022-11-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2022年三季报点评：锐捷网络分拆在即，期待新品放量带动业绩增长</td>\n",
       "      <td>星网锐捷</td>\n",
       "      <td>002396</td>\n",
       "      <td>民生证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2022-11-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>疫情影响减弱，品牌结构调整和渠道建设助力营收快速增长</td>\n",
       "      <td>百亚股份</td>\n",
       "      <td>003006</td>\n",
       "      <td>天风证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2022-11-07</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>450 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   title stockName stockCode orgSName  \\\n",
       "0   公司深度研究：美好生活系列报告之三——激光显示渔鱼双授，B+C端比翼齐飞      光峰科技    688007     国海证券   \n",
       "0         传统油气能源设备商收购洪田科技，布局电解铜箔设备迎新成长曲线      道森股份    603800     东吴证券   \n",
       "0                     首次覆盖报告：东方迎风起，线缆如潮至      东方电缆    603606   东亚前海证券   \n",
       "0                          成立三孚北理，研发驱动创新      三孚新科    688359   东亚前海证券   \n",
       "0                     商业板块运营稳定，融资支持受益度居前      新城控股    601155     中邮证券   \n",
       "..                                   ...       ...       ...      ...   \n",
       "0                   盈利能力持续提升，限电与疫情影响短期收入      川仪股份    603100     财通证券   \n",
       "0             Q3扣非高增157%再超预期，接单及盈利延续上行趋势      盛泰集团    605138     天风证券   \n",
       "0            综合性半导体存储巨头，两大品牌四大产线奠定国际领先地位       江波龙    301308     天风证券   \n",
       "0       2022年三季报点评：锐捷网络分拆在即，期待新品放量带动业绩增长      星网锐捷    002396     民生证券   \n",
       "0             疫情影响减弱，品牌结构调整和渠道建设助力营收快速增长      百亚股份    003006     天风证券   \n",
       "\n",
       "   emRatingName publishDate  \n",
       "0            买入  2022-11-11  \n",
       "0            增持  2022-11-11  \n",
       "0            增持  2022-11-11  \n",
       "0            买入  2022-11-11  \n",
       "0            买入  2022-11-11  \n",
       "..          ...         ...  \n",
       "0            增持  2022-11-07  \n",
       "0            买入  2022-11-07  \n",
       "0            买入  2022-11-07  \n",
       "0            买入  2022-11-07  \n",
       "0            买入  2022-11-07  \n",
       "\n",
       "[450 rows x 6 columns]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>title</th>\n",
       "      <th>stockName</th>\n",
       "      <th>stockCode</th>\n",
       "      <th>orgSName</th>\n",
       "      <th>emRatingName</th>\n",
       "      <th>publishDate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>万物云拟分拆上市，多元业务价值提升</td>\n",
       "      <td>万科A</td>\n",
       "      <td>000002</td>\n",
       "      <td>中泰证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2021-11-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>179</td>\n",
       "      <td>动态跟踪：万科物管分拆上市，开启多元化业务价值释放之路</td>\n",
       "      <td>万科A</td>\n",
       "      <td>000002</td>\n",
       "      <td>光大证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2021-11-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>192</td>\n",
       "      <td>万物云拟分拆上市，多元业务价值将逐渐被认可</td>\n",
       "      <td>万科A</td>\n",
       "      <td>000002</td>\n",
       "      <td>东方证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2021-11-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>228</td>\n",
       "      <td>万物云拟分拆上市，物管龙头未来可期</td>\n",
       "      <td>万科A</td>\n",
       "      <td>000002</td>\n",
       "      <td>平安证券</td>\n",
       "      <td>增持</td>\n",
       "      <td>2021-11-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>233</td>\n",
       "      <td>10月经营公告点评：单月销售降幅持续收窄，拿地权益比例维持高位</td>\n",
       "      <td>万科A</td>\n",
       "      <td>000002</td>\n",
       "      <td>天风证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2021-11-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>335</td>\n",
       "      <td>销售增速转负，多元业务稳步推进</td>\n",
       "      <td>万科A</td>\n",
       "      <td>000002</td>\n",
       "      <td>中泰证券</td>\n",
       "      <td>买入</td>\n",
       "      <td>2021-11-04</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   index                            title stockName stockCode orgSName  \\\n",
       "0     10                万物云拟分拆上市，多元业务价值提升       万科A    000002     中泰证券   \n",
       "1    179      动态跟踪：万科物管分拆上市，开启多元化业务价值释放之路       万科A    000002     光大证券   \n",
       "2    192            万物云拟分拆上市，多元业务价值将逐渐被认可       万科A    000002     东方证券   \n",
       "3    228                万物云拟分拆上市，物管龙头未来可期       万科A    000002     平安证券   \n",
       "4    233  10月经营公告点评：单月销售降幅持续收窄，拿地权益比例维持高位       万科A    000002     天风证券   \n",
       "5    335                  销售增速转负，多元业务稳步推进       万科A    000002     中泰证券   \n",
       "\n",
       "  emRatingName publishDate  \n",
       "0           买入  2021-11-12  \n",
       "1           买入  2021-11-08  \n",
       "2           买入  2021-11-08  \n",
       "3           增持  2021-11-07  \n",
       "4           买入  2021-11-07  \n",
       "5           买入  2021-11-04  "
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#sql = \"select * from dongfang_research_news where stockCode = '000002'\"\n",
    "#res_pd = pd.read_sql(sql, con=conn, index_col=None)\n",
    "#res_pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
